Local Aware IBM by danieljvickers · Pull Request #1378 · MFlowCode/MFC

danieljvickers · 2026-04-23T20:06:32Z

Description

The current code has an issue scaling much past 10k particles due to limitations in the MPIAllReduceSum in the IB force computation. This PR attempts to alleviate this by limiting the number of IBs any given rank can be aware of to its neighbors. This turns the AllReduce compute to a MPI neighbor computation, removing the communication bottlneck. To support this, a massive overhaul of IB ownership between ranks was required.

Type of change

Refactor

Testing

TBD

Checklist

I added or updated tests for new behavior

AI code reviews

Reviews are not triggered automatically. To request a review, comment on the PR:

@coderabbitai review — incremental review (new changes only)
@coderabbitai full review — full review from scratch
/review — Qodo review
/improve — Qodo code suggestions
@claude full review — Claude full review (also triggers on PR open/reopen/ready)
Add label claude-full-review — Claude full review via label

… boundaries

… are only locally aware

… intended

github-actions · 2026-04-23T20:10:39Z

Claude Code Review

Head SHA: 9879df3

Files changed:

21
src/simulation/m_ibm.fpp
src/simulation/m_start_up.fpp
src/simulation/m_collisions.fpp
src/simulation/m_global_parameters.fpp
src/common/m_constants.fpp
src/common/m_derived_types.fpp
src/common/m_mpi_common.fpp
src/post_process/m_data_output.fpp
src/simulation/m_ib_patches.fpp
src/simulation/m_time_steppers.fpp

Findings:

Critical: Stack overflow in `s_reduce_ib_patch_array` (`src/simulation/m_start_up.fpp`)

subroutine s_reduce_ib_patch_array()
    type(ib_patch_parameters), dimension(num_ib_patches_max) :: patch_ib_gbl

patch_ib_gbl is a local (stack-allocated) variable dimensioned by num_ib_patches_max, which was just raised from 50,000 to 2,050,000 in src/common/m_constants.fpp. ib_patch_parameters contains a character(LEN=pathlen_max) field (pathlen_max=400) plus ~500 bytes of numeric fields — roughly 900 bytes per element. At 2,050,000 elements this is ~1.8 GB on the call stack, which will segfault on every platform (typical stack limits are 8–64 MB). This variable must be heap-allocated (allocate/deallocate).

High: Debug print left in production code (`src/simulation/m_ibm.fpp`)

print *, proc_rank, " New Owner ", patch_ib(k)%gbl_patch_id  ! TODO :: REMOVE THIS DEBUG PRINT

This fires on every ownership transfer for every locally-tracked IB, generating unbounded output at scale. The ! TODO :: REMOVE THIS DEBUG PRINT marker confirms it is unintentional.

Medium: Commented-out code in `m_collisions.fpp` leaves `pid2` lookup absent (`src/simulation/m_collisions.fpp`)

! call s_get_neighborhood_idx(pid1, pid1) ! global patch ID -> local index call s_get_neighborhood_idx(pid2, pid2)
if (pid1 <= 0 .or. pid2 <= 0) cycle

The comment text contains call s_get_neighborhood_idx(pid2, pid2) — what appears to be a second intended lookup that was accidentally folded into the comment instead of being left as executable code. Neither lookup actually executes: pid1 and pid2 are the raw decoded global IDs from s_decode_patch_periodicity, not local indices. The guard if (pid1 <= 0 .or. pid2 <= 0) cycle would never trigger for valid global IDs (which are ≥ 1), meaning the subsequent patch_ib(pid1) and patch_ib(pid2) accesses use global IDs as local array indices, which is out-of-bounds when num_ibs < num_gbl_ibs. If this is intentional (global IDs equal local indices in the no-MPI / single-rank path), that should be documented; otherwise both lookups need to be uncommented.

Low: `GPU_UPDATE(device=[patch_ib(ib_idx)%moment])` removed without replacement (`src/simulation/m_ibm.fpp`)

-            patch_ib(ib_marker)%moment = moment*patch_ib(ib_marker)%mass/(count*cell_volume)
-            $:GPU_UPDATE(device='[patch_ib(ib_marker)%moment]')
+            patch_ib(ib_idx)%moment = moment*patch_ib(ib_idx)%mass/(count*cell_volume)

The per-field device update was removed. If patch_ib(i)%moment is consumed on the GPU before the next full GPU_UPDATE(device=[patch_ib(1:num_ibs)]) (e.g., in a subsequent call to s_ibm_correct_state within the same time step), the device will use a stale value. Verify that a covering full-struct update always precedes any GPU use of %moment after s_compute_moment_of_inertia is called from s_propagate_immersed_boundaries.

…local-aware-ibm

sbryngelson · 2026-05-04T14:42:19Z

Claude Code Review

PR #1378 — Local Aware IBM

This is a draft PR. The concept (replacing global allreduce with neighborhood-local communication for IBM forces) is sound and the scalability motivation is clear. Several correctness issues need to be addressed before this is merge-ready.

Critical Issues

1. f_local_rank_owns_location returns uninitialized result for single-process MPI builds
src/simulation/m_collisions.fpp

The new function wraps its entire ownership logic in #ifdef MFC_MPI / if (num_procs > 1) but provides no else branch. If num_procs == 1 inside an MPI build, owns_collision is never set, producing an uninitialized logical. The old f_local_rank_owns_collision handled this correctly with if (num_procs == 1) then; owns_collision = .true.. Fix: add else; owns_collision = .true. inside the if (num_procs > 1) block.

2. f_neighborhood_ranks_own_location silently disables locality check for 1–2 rank runs
src/simulation/m_collisions.fpp

The guard if (num_procs > 2) means that for num_procs == 2 the function unconditionally returns .true., so every IB is treated as neighborhood-owned with no filtering. Two-rank runs are the canonical small test configuration. The threshold should be > 1, not > 2.

3. Out-of-bounds potential in s_reduce_ib_patch_array
src/simulation/m_start_up.fpp

In 3D with 27 neighbor cells, num_aware_ibs can reach 27 * (local IB count), which for any non-trivial simulation exceeds num_local_ibs_max = 2000. There is no @:PROHIBIT or bounds guard before indexing local_ib_patch_ids(1:num_local_ibs_max). Silent memory corruption or segfault will occur once num_aware_ibs > 2000.

4. Analytic MIBM code generation uses local loop index, not global patch ID
toolchain/mfc/case.py — __get_analytic_mib_fpp

The template still emits if (i == {pid}) where pid is the global patch number from the Python case file and i is the local loop counter over num_ibs. After this PR, num_ibs counts locally-owned IBs, so local index i no longer maps to the global patch number. All simulations with analytically-defined IB motion will silently apply velocity to the wrong patches in multi-rank runs.

Important Issues

5. error stop in s_compute_image_points
src/simulation/m_ibm.fpp:485

if (bounds_error) error stop "Ghost Point and Image Point on Different Processors. Exiting"

error stop is forbidden by project standards (CLAUDE.md). Must be replaced with call s_mpi_abort(...).

6. models array allocated without matching @:DEALLOCATE
src/simulation/m_ibm.fpp:56

s_initialize_ibm_module has @:ALLOCATE(models(num_ibs)) but s_finalize_ibm_module has no matching @:DEALLOCATE(models). This violates the project rule requiring every @:ALLOCATE to have a paired @:DEALLOCATE, and leaks GPU device memory.

7. Post-process s_write_ib_variable called from all ranks with orphaned mesh
src/post_process/m_data_output.fpp

The PR removes the if (proc_rank == 0) guard, moving DBPUTPM (mesh definition) to rank 0 only but leaving DBPUTPV1 (variable write, inside s_write_ib_variable) called on all ranks. Non-zero ranks will write a point variable onto a mesh that was never defined in their per-process SILO file, producing corrupt output or a library error.

8. normal_axis normalization bug in s_compute_moment_of_inertia
src/simulation/m_ibm.fpp:1154

normal_axis = axis/sqrt(sum(axis))   ! BUG: missing **2

Should be sqrt(sum(axis**2)). As written this is sqrt(sum of components) rather than Euclidean norm, corrupting moment-of-inertia calculation for any 3D rotating IB. Pre-existing issue but touched by this PR.

Suggestions

num_local_ibs_max = 2000: add @:PROHIBIT(num_local_ibs > num_local_ibs_max, "num_local_ibs exceeds num_local_ibs_max") at the point of use.
Mixed #:if defined('MFC_MPI') vs #ifdef MFC_MPI within the same module — pick one for consistency.
Testing is marked "TBD". At minimum the existing 2D IBM/MIBM regression tests should be shown passing and a multi-rank (≥4) run with >10 IBs demonstrated before leaving draft.

Strengths

The architectural direction (local-only IB awareness replacing global allreduce) is the correct approach for scaling to O(10⁴) particles.
The s_encode_patch_periodicity/s_decode_patch_periodicity interface cleanly maintains global identity through the periodic marker field.
The 26-neighbor rank lookup table in s_compute_ib_neighbor_ranks avoids repeated topology queries at runtime.
The post-process SILO simplification (single mesh instead of per-rank mesh) is a positive cleanup.

…wrong branch

sbryngelson · 2026-05-06T15:03:41Z

Automated Code Review

Summary: 6 critical issues, 3 important issues. (Draft PR)

Critical Issues

1. error stop prohibited — use s_mpi_abort() instead (confidence: 100%)

src/simulation/m_ibm.fpp, s_compute_image_points:

if (bounds_error) error stop "Ghost Point and Image Point on Different Processors. Exiting"

error stop bypasses MPI finalization, hanging other ranks. Replace with call s_mpi_abort() or @:PROHIBIT(bounds_error, "...").

2. Missing @:DEALLOCATE for models array (confidence: 100%)

s_initialize_ibm_module allocates models via @:ALLOCATE(models(num_ibs)) but s_finalize_ibm_module has no matching @:DEALLOCATE(models). Since @:ALLOCATE also emits GPU_ENTER_DATA(create=...), the GPU memory is leaked. Per CLAUDE.md: every @:ALLOCATE must have a matching @:DEALLOCATE.

3. Missing @:DEALLOCATE for ghost_points array (confidence: 100%)

s_ibm_setup allocates ghost_points via @:ALLOCATE(ghost_points(1:max_num_gps)) and separately emits GPU_ENTER_DATA(copyin=...). Neither s_finalize_ibm_module nor s_update_mib frees it. Add @:DEALLOCATE(ghost_points) to finalization.

4. Dead expression in pressure correction loop — result silently discarded (confidence: 100%)

m_ibm.fpp, s_ibm_correct_state:

q_prim_vf(eqn_idx%E)%sf(j, k, l) + pres_IP/(1._wp - ...)

This is a bare expression — not an assignment. The pressure field is never updated. For moving_ibm > 1, pressure is always left at 0._wp from the zero-initialization line above. The commented-out code directly below confirms the intended form is an assignment:

q_prim_vf(eqn_idx%E)%sf(j, k, l) = q_prim_vf(eqn_idx%E)%sf(j, k, l) + pres_IP/(...)

5. MPI tag accumulation across phases in s_communicate_ib_forces (confidence: 95%)

The back-propagation Fypp loop does not reset tag = 300 — it continues from wherever the accumulation phase left off. For num_dims=2, accumulation x uses tag 300, y uses 302; back-propagation x starts at tag 304 (not 300). Since both sender and receiver apply the same increment, tags happen to match for symmetric topologies — but the protocol is fragile and non-obvious. Reset tag to a distinct base at the start of each phase (e.g., 300 for accumulation, 400 for back-propagation).

6. s_reduce_ib_patch_array never sets gbl_patch_id on single-rank / no-MPI path (confidence: 90%)

The #else (no-MPI) and num_procs == 1 branches assign patch_ib(:) = patch_ib_gbl(1:num_aware_ibs) but never set patch_ib(i)%gbl_patch_id = i. The multi-rank MPI path sets this explicitly. Although s_communicate_ib_forces returns early for num_procs == 1, s_get_neighborhood_idx with gbl_patch_id = 0 would never find a match, silently corrupting any path that reaches it. Fix: add patch_ib(i)%gbl_patch_id = i in the single-rank path.

Important Issues

7. requests(52) fixed array may overflow when ax >= 3 (confidence: 85%)

s_compute_ib_neighbor_ranks: nreqs is never reset to 0 between iterations of do k = 2, ax. On the second round, nreqs carries over from the first, incrementing past index 52 — out-of-bounds write. Fix: reset nreqs = 0 at the top of each k iteration.

8. De-duplication scheme in force accumulation untested for ib_neighborhood_radius > 1 (confidence: 80%)

The snapshot-based force de-duplication (subtract previous snapshot, add new) is sound for radius=1 but relies on each IB appearing at most once per received message. For radius > 1 with non-linear neighborhood topology, duplicate accumulation is possible. Needs explicit testing.

9. recv_forces_snap reset is per-dimension but inner loop overwrites unconditionally (confidence: 80%)

Minor design concern for the radius > 1 case — the accumulation loop structure may not generalize cleanly beyond the tested radius=1 case.

Suggestions

Debug print * left in s_handoff_ib_ownership: print *, proc_rank, " New Owner ", ... ! TODO :: REMOVE THIS DEBUG PRINT — remove before merge
Commented-out ! print *, proc_rank, num_local_ibs, ... in s_ibm_setup — remove before merge

Strengths

Neighborhood rank discovery protocol is well-structured (face→edge→corner seeding with distinct tag range)
gbl_patch_id matching across ranks is architecturally sound for the moving IB migration case
storage_size(0)/8 used correctly for byte sizing (no hardcoded widths)
GPU_* Fypp macros used consistently throughout — no raw !$acc/!$omp directives found

…packing in force communication

qodo-code-review · 2026-05-11T18:30:36Z

ⓘ You've reached your Qodo monthly free-tier limit. Reviews pause until next month — upgrade your plan to continue now, or link your paid account if you already have one.

danieljvickers added 16 commits March 15, 2026 07:51

Remove interior point conservative variable protection for stationary…

4273c84

… boundaries

Merge branch 'master' of github.com:danieljvickers/MFC

4b2f519

Merge branch 'MFlowCode:master' into master

ba0cacb

Merge branch 'MFlowCode:master' into master

e6988ba

Merge branch 'MFlowCode:master' into master

0ce6f11

Merge branch 'MFlowCode:master' into master

16bd456

Merge branch 'MFlowCode:master' into master

f0e117d

Merge branch 'MFlowCode:master' into master

b42028a

Merge branch 'MFlowCode:master' into master

8728091

Initial separation for patch_ibs

f652274

Added IB patch reduction at the start of the simulation so that ranks…

3baf894

… are only locally aware

intermittent commit

2365f2c

we now write the global IB index, not the local one to ib_markers, as…

71bae6a

… intended

Refactored ib reduction to use neighbor bounds

0779098

prototype of send-receive replacing all-to-all

d9ac1c2

Compilation errors resolved

ee0bc0c

danieljvickers and others added 13 commits April 24, 2026 08:31

Resolved out of bounds error

8303cea

added send test algorithm for alternative MPI communication

1c1801c

Fixed early segfault due to uninitialized IB patch array

f014ca9

Debugged rank ownership bug and invalid number of global IBs

d36fe0b

Fixed global patch ID not being present on other ranks

cc76bf3

Merge branch 'master' into local-aware-ibm

e6e0613

Updating restart data

13ad7d0

Merge branch 'local-aware-ibm' of github.com:danieljvickers/MFC into …

2983eba

…local-aware-ibm

add integer declaration

35b7864

Fixed duplicate particle output

0ab63cb

Updated post processing

fce3071

Fixed stalling issues in proc_rank > 2 cases

6e9cd50

Significant MPI debug

92b776d

danieljvickers added 4 commits May 1, 2026 13:47

Merge branch 'master' of github.com:danieljvickers/MFC

3144880

Fixed STL mdoels

9b9975b

Finished all features, now I need to unbreak test suite

a07f9af

Small debugging and toolchain support

877b5ee

danieljvickers added 10 commits May 4, 2026 14:42

Fixed nvhpc compilation issue

aa86dad

Update processor inforamtion to GPU for use in subroutines

73408c9

Fixed periodic GPU cases

97900ab

Fixed IBs in multi-rank periodicity

4150b8e

Clean up lines using the new neighborhood ib patch getter subroutine

3f30041

Passes tests on NVHPC 25.11

ff45acc

Deleted some extra lines

05dcede

Merged in several major changes that were accidentally pushed to the …

a087388

…wrong branch

Works on 3-ranks

f282fe9

Fixed compiler issue for non-mpi cases

2126d4e

Daniel Vickers and others added 9 commits May 6, 2026 14:42

fixed stalling issue on multirank cases for cray compilers

01cd346

Fixed memory corruption issue on 8k particle case

401d243

Fixed neighbor instantiation for larger-than 8-rank cases

0f79876

Successfully ran 128k IBs

6446782

fixed formatting issue

f6155f3

Lookup table for improved perofrmance

c8379a0

CHanges to allow rocm profiling on frontier and to parallelize array …

a631916

…packing in force communication

fixed incorrect array sizing

0f9eb30

Altered dynamic array allocation

9879df3

danieljvickers marked this pull request as ready for review May 11, 2026 18:30

danieljvickers requested a review from sbryngelson as a code owner May 11, 2026 18:30

danieljvickers added 2 commits May 11, 2026 14:31

Merge branch 'master' into local-aware-ibm

8fc61d9

Merge branch 'master' into local-aware-ibm

6bdf257

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Local Aware IBM#1378

Local Aware IBM#1378
danieljvickers wants to merge 66 commits into
MFlowCode:masterfrom
danieljvickers:local-aware-ibm

danieljvickers commented Apr 23, 2026

Uh oh!

github-actions Bot commented Apr 23, 2026 •

edited

Loading

Uh oh!

sbryngelson commented May 4, 2026

Uh oh!

sbryngelson commented May 6, 2026

Uh oh!

qodo-code-review Bot commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

Conversation

danieljvickers commented Apr 23, 2026

Description

Type of change

Testing

Checklist

AI code reviews

Uh oh!

github-actions Bot commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Critical: Stack overflow in s_reduce_ib_patch_array (src/simulation/m_start_up.fpp)

High: Debug print left in production code (src/simulation/m_ibm.fpp)

Medium: Commented-out code in m_collisions.fpp leaves pid2 lookup absent (src/simulation/m_collisions.fpp)

Low: GPU_UPDATE(device=[patch_ib(ib_idx)%moment]) removed without replacement (src/simulation/m_ibm.fpp)

Uh oh!

sbryngelson commented May 4, 2026

Claude Code Review

Critical Issues

Important Issues

Suggestions

Strengths

Uh oh!

sbryngelson commented May 6, 2026

Automated Code Review

Critical Issues

Important Issues

Suggestions

Strengths

Uh oh!

qodo-code-review Bot commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

github-actions Bot commented Apr 23, 2026 •

edited

Loading

Critical: Stack overflow in `s_reduce_ib_patch_array` (`src/simulation/m_start_up.fpp`)

High: Debug print left in production code (`src/simulation/m_ibm.fpp`)

Medium: Commented-out code in `m_collisions.fpp` leaves `pid2` lookup absent (`src/simulation/m_collisions.fpp`)

Low: `GPU_UPDATE(device=[patch_ib(ib_idx)%moment])` removed without replacement (`src/simulation/m_ibm.fpp`)